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ABSTRACT 

We  describe  a  three-tiered  approach  for  evaluation  of  spoken 
dialogue  systems.  The  three  tiers  measure  user  satisfaction, 
system  support  of  mission  success  and  component  performance. 
We  describe  our  use  of  this  approach  in  numerous  fielded  user 
studies  conducted  with  the  U.S.  military. 
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1.  INTRODUCTION 

Evaluation  of  spoken  language  systems  is  complicated  by  the 
need  to  balance  distinct  goals.  For  collaboration  with  others 
in  the  speech  technology  community,  metrics  must  be  generic 
enough  for  comparison  to  analogous  systems.  For  project  man¬ 
agement  and  business  purposes,  metrics  must  be  specific 
enough  to  demonstrate  end-user  utility  and  improvement  over 
other  approaches  to  a  problem. 

Since  1998,  we  have  developed  a  spoken  language  dialogue 
technology  called  Listen-Communicate-Show  (ECS)  and 
applied  it  to  demonstration  systems  for  U.S.  Marines  logistics, 
U.S.  Army  test  data  collection,  and  commercial  travel  reserva¬ 
tions.  Our  focus  is  the  transition  of  spoken  dialogue  technol¬ 
ogy  to  military  operations.  We  support  military  users  in  a  wide 
range  of  tasks  under  diverse  conditions.  Therefore,  our  defini¬ 
tion  of  success  for  ECS  is  operational  success.  It  must  reflect 
the  real  world  success  of  our  military  users  in  perfonning  their 
tasks.  In  addition,  for  our  systems  to  be  considered  successful, 
they  must  be  widely  usable  and  easy  for  all  users  to  operate 
with  minimal  training.  Our  evaluation  methodology  must 
model  these  objectives. 

With  these  goals  in  mind,  we  have  developed  a  three-tier 
metric  system  for  evaluating  spoken  language  system  effective¬ 
ness.  The  three  tiers  measure  (1)  user  satisfaction,  (2)  system 
support  of  mission  success  and  (3)  component  performance. 


2.  THE  THREE-TIERED  APPROACH 

Our  three-tier  metric  scheme  evaluates  multiple  aspects  of  ECS 
system  effectiveness.  User  satisfaction  is  a  set  of  subjective 
measures  that  introduces  user  perceptions  into  the  assessment 
of  the  system.  System  support  of  mission  success  measures 
overall  system  performance  with  respect  to  our  definition  of 
success.  Component  performance  scores  the  individual  sys¬ 
tem  component’s  role  in  overall  system  success. 

Collection  of  user  input  is  essential  in  evaluation  for  two  rea¬ 
sons.  First,  it  is  necessary  to  consider  user  perspective  during 
evaluation  to  achieve  a  better  understanding  of  user  needs. 
Second,  user  preference  can  influence  interpretation  of  success 
measurements  of  mission  success  and  component  performance. 
Mission  success  and  component  performance  are  often  tradeoffs, 
with  inefficient  systems  producing  higher  scores  of  success. 
Since  some  users  are  willing  to  overlook  efficiency  for  guaran¬ 
teed  performance  while  others  opt  for  efficiency,  our  collection 
of  user  input  helps  determine  the  relative  importance  of  these 
aspects. 

Mission  success  is  difficult  to  quantify  because  it  is  defined 
differently  by  users  with  different  needs.  Therefore,  it  is  essen¬ 
tial  to  establish  a  definition  of  mission  success  early  in  the 
evaluation  process.  For  our  applications,  we  derive  this  defini¬ 
tion  from  domain  knowledge  acquisition  with  potential  users. 

It  is  important  to  evaluate  components  individually  since  com¬ 
ponent  evaluations  reveal  distinctive  component  flaws.  These 
flaws  can  negatively  impact  mission  success  because  catas¬ 
trophic  failure  of  a  component  can  prevent  the  completion  of 
tasks.  For  example,  in  the  Marine  logistics  domain,  if  the 
system  fails  to  recognize  the  user  signing  onto  the  radio  net¬ 
work,  it  will  ignore  all  subsequent  utterances  until  the  user 
successfully  logs  on.  If  the  recognition  of  sign-on  completely 
fails,  then  no  tasks  can  be  completed.  In  addition,  periodic 
evaluation  of  component  performance  focuses  attention  on 
difficult  problems  and  possible  solutions  to  these  problems 
[!]■ 

3.  EVALUATION  METRICS 

At  the  top  level  of  our  approach,  measurements  of  overall  user 
satisfaction  are  derived  from  a  collection  of  user  reactions  on  a 
Likert-scaled  questionnaire.  The  questions  are  associated  with 
eight  user  satisfaction  metrics:  ease  of  use,  system  response, 
system  understanding,  user  expertise,  task  ease,  response  time, 
expected  behavior  and  future  use.  We  have  categorized  our  user 
satisfaction  questions  in  terms  of  specific  metrics  as  per  the 
PARADISE  methodology  [5,  2].  These  metrics  are  detailed  in 
Table  1. 
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Table  1.  User  Satisfaction  metrics 


Metric 

Description 

Example  Likert  Survey  Questions 

Ease  of  Use 

User  perception  of  ease  of  interaction  with 
overall  system 

The  system  was  easy  to  use 

System 

Response 

Clarity  of  system  response 

System  responses  were  clear  and  easy  to  understand 

System 

Understanding 

System  comprehension  of  the  user 

The  system  understood  what  you  said 

User  Expertise 

Shows  us  how  prepared  the  user  felt  due  to  our 
training 

You  knew  how  to  interact  with  the  system  based  on 
previous  experience  or  training 

Task  Ease 

User  ease  in  performing  a  given  task 

It  was  easy  to  make  a  request 

Response  Time 

User’s  impression  of  the  speed  of  system’s 
reply 

The  system  responded  to  you  in  a  timely  manner 

Expected 

Behavior 

Connection  between  the  user’s  experience  and 
preconceived  notions 

The  system  worked  the  way  that  you  expected  it  to 

Future  Use 

Determination  of  overall  acceptance  of  this  type 
of  system  in  the  future 

You  would  use  a  mature  system  of  this  type  in  the  future 

The  middle  tier  metrics  measure  the  ability  of  users  to 
successfully  complete  their  domain  tasks  in  a  timely  manner. 
Success,  in  this  case,  is  defined  as  completion  of  a  task  and 
segments  of  the  task  utilizing  the  information  supplied  by  the 
user.  A  task  is  considered  successful  if  the  system  was  able  to 
comprehend  and  process  the  user’s  request  correctly.  It  is 
important  to  determine  if  success  was  achieved  and  at  what 
cost.  The  user’s  ability  to  make  a  request  in  a  reasonable 
amount  of  time  with  little  repetition  is  also  significant.  The 
mission  success  metrics  fall  under  nine  categories:  task  com¬ 
pletion,  task  complexity,  dialogue  complexity,  task  efficiency, 
dialogue  efficiency,  task  pace,  dialogue  pace,  user  frustration 
and  intervention  rate. 

For  these  metrics,  we  consider  the  tasks  the  user  is  trying  to 
accomplish  and  the  dialogue  in  which  the  user  has  with  the 
system  to  accomplish  those  tasks.  A  session  is  a  continuous 
period  of  user  interaction  with  the  spoken  dialogue  system.  A 
session  can  be  examined  from  two  perspectives,  task  and  dia¬ 
logue,  as  shown  in  Figure  1.  Segments  are  atomic  operations 
performed  within  a  task.  The  success  rate  of  each  segment  is  an 
important  part  of  the  analysis  of  the  system,  while  the  success 
rate  of  each  task  is  essential  for  the  comprehensive  evaluation 
of  the  system.  For  example,  a  task  of  ordering  supplies  in  the 
Marine  logistics  domain  includes  segments  of  signing  onto  the 
radio  network,  starting  the  request  fonn,  filling  in  items  a 
through  h,  submitting  the  form  and  signing  off  the  network. 
Each  segment  receives  an  individual  score  of  successfully 
completion.  The  Task  Completion  metric  consists  of  success 
scores  for  the  overall  task  and  the  segments  of  the  task. 

Dialogue  is  the  collection  of  utterances  spoken  to  accomplish 
the  given  task.  It  is  necessary  to  evaluate  Dialogue  Efficiency 
to  achieve  an  understanding  of  how  complex  the  user’s  dia¬ 
logue  is  for  the  associated  task.  A  turn  is  one  user  utterance,  a 
step  in  accomplishing  the  task  through  dialogue.  Concepts  are 
atomic  bits  of  information  conveyed  in  a  dialogue.  For  example, 
if  the  user’s  utterance  consists  of  delivery  time  and  delivery 
location  for  a  particular  Marine  logistic  request,  the  time  and 
location  are  the  concepts  of  that  turn.  These  metrics  are 
described  in  greater  detail  in  Table  2. 


Figure  1.  Structural  Hierarchy  of  a  Spoken  Dialogue 
System  Session 


The  lowest  level  tier  measures  the  effectiveness  of  individual 
system  components  along  specific  dimensions,  including  com¬ 
ponent  error  rates.  Overall  system  level  success  is  determined 
by  how  well  each  component  accomplishes  its  responsibility. 
This  concerns  measurements  such  as  word  accuracy,  utterance 
accuracy,  concept  accuracy,  component  speed,  processing 
errors,  and  language  errors.  These  measurements  aid  system 
developers  by  emphasizing  component  weakness.  Component 
Performance  metrics  also  offer  explanations  for  others  metrics. 
For  example,  bottlenecks  within  a  component  may  be  respon¬ 
sible  for  slow  system  response  time.  Another  example  is 
concerned  with  recognition  accuracy.  Poor  word  accuracy  may 
account  for  low  scores  of  task  completion  and  user  satisfaction 
with  the  system. 


Table  2.  Mission  metrics 


Metric 

Description 

Measurement 

Task  Completion 

Success  rate  of  a  given  task 

y  correct  segments 

2  items 

Task  Complexity 

Ideal  minimal  information  required  to 
accomplish  a  task 

y  ideal  concents 

task 

Dialogue 

Complexity 

Ideal  amount  of  interaction  with  the  system 
necessary  to  complete  a  task 

y  ideal  turns 

task 

Task  Efficiency 

Amount  of  extraneous  information  in 
dialogue 

y  ideal  concents 

2  actual  concepts 

Dialogue 

Efficiency 

Number  of  extraneous  turns  in  dialogue 

y  ideal  turns 

2  actual  turns 

Task  Pace 

Real  world  time  spent  entering  information 
into  the  system  to  accomplish  the  task 

y  elansed  time 

task  complexity 

Dialogue  Pace 

Actual  amount  of  system  interaction  spent 
entering  segments  of  a  task 

y  turns 

task  complexity 

User  Frustration 

Ratio  of  repairs  and  repeats  to  useful  turns 

y  trenhrases  +  reneats) 

2  relevant  turns 

Intervention  Rate 

How  often  the  user  needs  help  to  use  the 
system 

2  (user  questions  +  moderator  corrections  +  system  crashes) 

Some  component  performance  metrics  rely  upon  measurements 
from  multiple  components.  For  example,  Processing  Errors 
combines  data  transfer  errors,  logic  errors,  and  agent  errors. 
Those  measurements  map  to  the  Turn  Manager  which  controls 
the  system's  dialogue  logic,  the  Mobile  Agents  which  interface 
with  data  sources,  and  the  Hub  which  coordinates  component 
communication.  The  metrics  are  discussed  in  Table  3. 

4.  EVALUATION  PROCESS 

Our  LCS  systems  are  built  upon  MIT’s  Galaxy  11  architecture 
[3].  Galaxy  II  is  a  distributed,  plug  and  play  component-based 
architecture  in  which  specialized  servers  handle  specific  tasks, 
such  as  translating  audio  data  to  text,  that  communicate 
through  a  central  server  (Hub).  The  LCS  system  shown  in 
Figure  2  includes  servers  for  speech  recording  and  playback 
(Audio  I/O),  speech  synthesis  (Synthesis),  speech  recognition 
(Recognizer),  natural  language  processing  (NL),  discourse/ 


response  logic  (Turn  Manager),  and  an  agent  server  (Mobile 
Agents)  for  application/database  interaction. 

We  implement  a  number  of  diverse  applications  and  serve  a 
user  population  that  has  varying  expertise.  The  combination  of 
these  two  factors  result  in  a  wide  range  of  expectations  of 
system  performance  by  users.  We  have  found  that  the  three-tier 
system  and  related  evaluation  process  not  only  capture  those 
expectations,  but  also  aid  in  furthering  our  development. 

Our  evaluation  process  begins  with  conducting  a  user  study, 
typically  in  the  field.  We  refer  to  these  studies  as  Integrated 
Feasibility  Experiments  (IFE).  Participants  involved  in  the 
IFEs  are  trained  to  use  their  particular  LCS  application  by  a 
member  of  our  development  team.  The  training  usually  takes  1  5 
to  30  minutes.  The  training  specifies  the  purpose  of  the  LCS 
application  in  aiding  their  work,  includes  a  brief  description  of 
the  LCS  architecture,  and  details  the  speech  commands  and 


Table  3.  Component  metrics 


Metric 

Description 

Measurement 

Word  Accuracy 

System  recognition  per  word 

NIST  String  Alignment  and  Scoring  Program 

Utterance 

Accuracy 

System  recognition  per  user  utterance 

y  recognized  turns 

2  turns 

Concept 

Accuracy’" 

Semantic  understanding  of  the  system 

y  recognized  concents 

2  concepts 

Component  Speed 

Speed  of  various  components 

time  per  turn 

Processing  Errors 

Percent  of  turns  with  low  level  system  error 
measurements 

y  /agent  errors  +  frame  construction  errors  +  logic  errors) 

2  system  turns 

Language  Errors 

Percent  of  turns  with  errors  in  sentence 
construction,  word  parsing  and  spoken  output 
of  the  system 

y  fnarse  errors  +  svnthesis  errors) 

2  system  turns 

Our  use  of  concept  accuracy  was  inspired  by  the  concept  accuracy  metric  of  the  PARADISE  methodology  [5]. 


expected  responses  through  demonstration.  After  the  intro¬ 
ductory  instruction  and  demonstration,  participants  practice 
interacting  with  the  system. 

For  each  study,  we  develop  a  set  of  scenarios  based  upon  our 
knowledge  of  the  domain  and  ask  each  participant  to  complete 
the  scenarios  as  quickly  as  they  can  with  maximal  accuracy  and 
minimal  moderator  assistance.  The  study  usually  consists  of 
approximately  five  task  scenarios  of  varying  difficulty.  The 
scenarios  are  carried  out  in  fixed  order  and  are  given  a  time 
limit,  generally  no  longer  than  30  minutes.  The  system  logs 
key  events  at  the  Hub,  including  times  and  values  for  the 
user’s  speech  recording,  recognition  hypotheses,  grammatical 
parse,  resultant  query,  component  speeds,  any  internal  errors, 
and  the  system’s  response.  In  addition,  the  moderator  notes 
any  assistance  or  intervention,  such  as  reminding  the  user  of 
proper  usage  or  fixing  an  application  error.  Once  the  tasks  are 
completed,  the  user  fills  out  a  web-based  survey  and  partici¬ 
pates  in  a  brief  interview.  These  determine  user  satisfaction 
with  the  system. 

Upon  conclusion  of  a  user  study,  we  extract  the  log  files  and 
code  the  users’  recordings  through  manual  transcription.  We 
add  diagnostic  tags  to  the  log  files,  noting  such  events  as 
rephrased  utterances  and  causes  of  errors  and  then  audit  all  of 
the  logs  for  accuracy  and  consistency.  Some  of  the  diagnostic 
tags  that  we  annotate  are  number  of  items  and  concepts  within 
an  utterance,  frame  construction  errors,  repeated  or  rephrased 
utterances  and  deficiencies  of  the  training  sentence  corpus. 
This  is  a  very  time  consuming  process.  Therefore,  it  is  nec¬ 
essary  to  involve  multiple  people  in  this  phase  of  the 
evaluation.  However,  one  individual  is  tasked  with  the  final 
responsibility  of  examining  the  annotations  for  consistency. 

A  series  of  scripts  and  spreadsheets  calculate  our  metrics  from 
the  log  files.  These  scripts  take  the  log  files  as  parameters  and 
produce  various  metric  values.  While  interpreting  the  metrics 
values,  we  may  re-examine  the  log  files  for  an  exploration  of 
detail  related  to  particular  tasks  or  events  in  order  to 
understand  any  significant  and  surprising  results  or  trends. 

Finally,  through  a  mixture  of  automated  formatting  and  manual 
commentary,  we  create  a  summary  presentation  of  the  user  study 
results.  Web  pages  are  generated  that  contain  some  of  the 
metrics  collected  throughout  the  study. 


5.  APPROACH  VERIFICATION 

We  have  applied  our  approach  in  four  separate  IFEs  to  date.  In 
each  case,  our  metrics  revealed  areas  for  improvement.  As  these 
improvements  were  made,  the  problems  discovered  in  the  next 
IFE  were  more  subtle  and  deeply  ingrained  within  the  system. 
Mission  success  and  component  metrics  aided  in  the  interpre¬ 
tation  of  user  perception  and  drove  future  system  development. 
A  top-level  summary  of  IFEs,  metrics  and  system  improvements 
is  described. 

The  first  IFE  was  our  pilot  study,  which  took  place  in-house  in 
September  1999.  Five  subjects  with  varying  military  experi¬ 
ence  were  asked  to  complete  three  tasks,  which  were  scripted 
for  them.  The  tier  one  metrics  revealed  the  users’  dissatisfaction 
with  the  system  responses  and  the  time  required  in  receiving 
them.  These  perceptions  led  to  system  changes  within  our 
Agent  and  Turn  Manager  structures  that  improved  the  speed  of 
our  database  agents  and  more  appropriate  responses  from  the 
ECS  system. 

The  second  IFE  took  place  during  the  Desert  Knight  1999 
Marine  exercise  at  Twentynine  Palms,  CA  in  December  1999. 
Ten  subjects,  each  an  active  duty  Marine  with  varying  radio 
operator  experience,  were  given  five  tasks.  This  user  study 
offered  the  subjects  the  option  of  following  scripts  in  their 
tasks.  The  metrics  of  tier  one  showed  an  increase  in  overall  user 
satisfaction  and  revealed  the  users’  difficulty  using  the  system 
and  anticipating  its  behavior.  These  concerns  influenced  future 
user  training  and  the  development  of  more  explicit  system 
responses. 

The  third  IFE  occurred  during  the  Marine  CAX  6  (Combined 
Arms  Exercise)  at  Twentynine  Palms,  CA  in  April  2000.  The 
seven  subjects  were  active  duty  Marines,  some  with  minimal 
radio  training.  They  were  required  to  complete  five  tasks  that 
had  scenario-based,  non-scripted  dialogues.  A  combination  of 
tier  one,  tier  two  and  tier  three  metrics  exposed  a  deficiency  in 
the  speech  recognition  server,  prompting  us  to  increase 
recognizer  training  for  subsequent  IFEs.  A  recognizer  training 
corpus  builder  was  developed  to  boost  recognition  scores. 

The  most  recent  IFE  was  conducted  in  Gulfport,  MS  during  the 
August  2000  Millennium  Dragon  Marine  exercise.  Six  active 
duty  Marines  with  varied  radio  experience  completed  five 
scenario-based  tasks.  This  time  the  users  expressed  concern 
with  system  understanding  and  ease  of  use  through  the  tier  one 
metrics.  The  tier  three  metrics  revealed  an  error  in  our  natural 
language  module,  which  sometimes  had  been  selecting  the 
incorrect  user  utterance  from  recognizer  output.  This  error  has 
since  been  removed  from  the  system. 

The  three-tiered  approach  organizes  analysis  of  the  inter¬ 
dependence  among  metrics.  It  is  useful  to  study  the  impact  of  a 
metric  in  one  tier  against  metrics  in  another  tier  through 
principal  component  analysis.  These  statistics  do  not  neces¬ 
sarily  evidence  causality,  of  course,  but  they  do  suggest 
insightful  correlation.  This  insight  exposes  the  relative  signifi¬ 
cance  of  various  factors'  contribution  to  particular  assessments 
of  mission  success  or  user  satisfaction. 

6.  FUTURE  ENHANCEMENTS 

Although  this  three-tier  evaluation  process  provides  useful 
metrics,  we  have  identified  three  improvements  that  we  plan  to 
incorporate  into  our  process:  (1)  an  annotation  aide,  (2)  com¬ 
munity  standardization,  and  (3)  increased  automation.  The 


annotation  aide  would  allow  multiple  annotators  to  review 
and  edit  logs  independently.  With  this  tool,  we  could  autom¬ 
atically  measure  and  control  cross-annotator  consistency, 
currently  a  labor-intensive  chore.  Community  standardization 
entails  a  logging  format,  an  annotation  standard,  and  calcula¬ 
tion  tools  common  to  the  DARPA  Communicator  project  [4], 
several  of  which  have  been  developed,  but  we  are  still  working 
to  incorporate  them.  The  advantage  of  community  standardiza¬ 
tion  is  the  benefit  from  tools  developed  by  peer  organizations 
and  the  ability  to  compare  results.  Accomplishing  the  first  two 
improvements  largely  leads  to  the  third  improvement,  increased 
automation,  because  most  (if  not  all)  aspects  from  measurement 
through  annotation  to  calculation  then  have  a  controlled 
fonnat  and  assistive  tools.  These  planned  improvements  will 
make  our  evaluation  process  more  reliable  and  less  time- 
consuming  while  simultaneously  making  it  more  controlled 
and  more  comparable. 

7.  CONCLUSION 

We  have  found  that  structuring  evaluation  according  to  the 
three  tiers  described  above  improves  the  selection  of  metrics 
and  interpretation  of  results.  While  the  essence  of  our  approach 
is  domain  independent,  it  does  guide  the  adaptation  of  metrics 
to  specific  applications.  First,  the  three  tiers  impose  a  structure 
that  selects  certain  metrics  to  constitute  a  broad  pragmatic 
assessment  with  minimal  data,  refining  the  subject  of  evalua¬ 
tion.  Second,  the  three  tiers  organize  metrics  so  that  user 
satisfaction  and  mission  metrics  have  clear  nonnative  semantics 
(results  interpreted  as  good/bad)  and  they  reveal  the  impact  of 
low-level  metrics  (results  tied  to  particular  components  which 
may  be  faulted/lauded).  Finally,  improvements  in  selection  and 
interpretation  balance  satisfaction,  effectiveness,  and  perform¬ 


ance,  thus  imbuing  the  evaluation  process  with  focus  toward 
utility  for  practical  applications  of  spoken  language  dialogue. 
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